Skip to content

fix: count only real NaN/Inf events in training monitor#20

Open
ogulcanaydogan wants to merge 7 commits into
mainfrom
fix/monitor-real-nan-count
Open

fix: count only real NaN/Inf events in training monitor#20
ogulcanaydogan wants to merge 7 commits into
mainfrom
fix/monitor-real-nan-count

Conversation

@ogulcanaydogan
Copy link
Copy Markdown
Owner

Summary: fix false-positive nan_count in A100 monitor by counting only metric NaN/Inf values and explicit nan guard events. Validation: bash -n script passed; restarted monitor service on A100 and observed nan_count=0 during active training.

Copilot AI review requested due to automatic review settings February 24, 2026 07:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate false-positive NaN/Inf monitoring during long-running training (notably on A100) by adding an explicit NaN/Inf guard callback, refining how monitoring scripts count “real” NaN/Inf events, and introducing supporting operational tooling (systemd units + post-run utilities).

Changes:

  • Add NaNGuardCallback (and tests) to stop training after consecutive non-finite metric detections.
  • Wire training callbacks + max_grad_norm into the training orchestrator and config.
  • Add remote-ops scripts/services: systemd units, watchdog, monitor, manifest generator, and post-training pipeline helpers.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_callbacks.py Adds unit tests for NaNGuardCallback behavior (stop + reset).
src/forge/utils/config.py Adds training.max_grad_norm to the pydantic config model.
src/forge/training/trainer.py Wires max_grad_norm and adds early-stopping + NaN guard callbacks to the trainer.
src/forge/training/callbacks.py Introduces NaNGuardCallback + non-finite detection helper and formalizes callback base class.
scripts/training_watchdog.py Adds a watchdog that restarts systemd training service on repeated NaN lines / stalls.
scripts/start_a100_training.sh Adds a systemd-friendly entrypoint for A100 training runs (resume support, cache env).
scripts/set_wandb_key.sh Helper to safely write WANDB_API_KEY to a user env file and restart services.
scripts/post_training_pipeline.sh Adds a post-run pipeline (manifest → evals → merge → optional serve checks).
scripts/monitor_a100_training.sh Adds a progress/ETA/NaN counter monitor that writes status files.
scripts/install_training_services.sh Installs/enables systemd user services and bootstraps env file.
scripts/generate_training_manifest.py Adds deterministic manifest generator for completed training runs.
docs/TRAINING_GUIDE.md Links to the new roadmap document.
docs/ROADMAP.md Adds a structured post-run roadmap and definition-of-done.
deploy/systemd/forge-training.service Adds training systemd user unit.
deploy/systemd/forge-training-watchdog.service Adds watchdog systemd user unit.
deploy/systemd/forge-training-monitor.service Adds monitor systemd user unit.
configs/models/turkcell_7b_a100_v4_recovery.yaml Adds A100 recovery profile (incl. max_grad_norm).
configs/models/turkcell_7b_a100_v3_clean.yaml Adds A100 “stable” profile (incl. max_grad_norm).
configs/models/turkcell_7b_a100_resume_bf16_clean.yaml Adds A100 resume bf16 profile.
README.md Adds a short “Post-Completion Roadmap” section linking to docs/ROADMAP.md.
Makefile Adds manifest, post-train, set-wandb targets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

#!/usr/bin/env bash
set -euo pipefail

cd /home/weezboo/projects/LowResource-LLM-Forge
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd /home/weezboo/... hard-codes a user-specific path and will break when the repo lives elsewhere (even though the systemd unit already sets WorkingDirectory). Consider switching to a configurable PROJECT_ROOT (as in other scripts) or using the script directory to resolve paths.

Suggested change
cd /home/weezboo/projects/LowResource-LLM-Forge
PROJECT_ROOT="${PROJECT_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"
cd "$PROJECT_ROOT"

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +60
# Count only real NaN/Inf metric values and explicit NaN guard events.
metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"
guard_nan_count="$(grep -a -E -c "nan_guard_detected|nan_guard_stopping_training" "$LOG_FILE" || true)"
metric_nan_count="${metric_nan_count:-0}"
guard_nan_count="${guard_nan_count:-0}"
nan_count=$((metric_nan_count + guard_nan_count))
fi
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan_count will likely double-count NaN/Inf events: metric_nan_count matches any occurrence of '(loss|grad_norm|eval_loss)': ... (nan|inf) anywhere in the log, including inside bad_metrics={...} on nan_guard_detected lines, and then guard_nan_count counts those same guard lines again. To keep the counter accurate, exclude nan_guard_* lines from the metric grep (or tighten the metric-line pattern so it only matches the trainer’s metric dict lines) before adding guard_nan_count.

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +55
# Count only real NaN/Inf metric values and explicit NaN guard events.
metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric grep only matches nan|inf but not +inf / -inf, while the Python-side _is_non_finite treats those as non-finite too. If you want monitor parity with the callback, extend the pattern to include signed inf variants.

Suggested change
# Count only real NaN/Inf metric values and explicit NaN guard events.
metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"
# Count only real NaN/Inf metric values (including +inf/-inf) and explicit NaN guard events.
metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|[+-]?inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +183
if current_metric_hash and current_metric_hash != state.last_metric_hash:
state.last_metric_hash = current_metric_hash
if "nan" in last_metric.lower():
state.nan_consecutive += 1
else:
state.nan_consecutive = 0
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watchdog increments nan_consecutive only when 'nan' in last_metric.lower(). Since the monitor/callback logic treats Inf as non-finite too, this misses inf, +inf, -inf events and could fail to restart on an Inf blow-up. Consider checking for both NaN and Inf (ideally with the same regex used elsewhere).

Copilot uses AI. Check for mistakes.
Comment thread Makefile
Comment on lines +85 to +93
manifest: ## Generate training manifest (TRAIN_CONFIG/RUN_DIR/LOG_FILE)
@if [ -z "$(RUN_DIR)" ] || [ -z "$(LOG_FILE)" ]; then \
echo "Usage: make manifest TRAIN_CONFIG=<yaml> RUN_DIR=<run_dir> LOG_FILE=<train_log>"; \
exit 1; \
fi
uv run python scripts/generate_training_manifest.py \
--config "$(TRAIN_CONFIG)" \
--run-dir "$(RUN_DIR)" \
--log-file "$(LOG_FILE)"
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make manifest uses $(TRAIN_CONFIG) but the argument validation only checks RUN_DIR and LOG_FILE. If a user clears TRAIN_CONFIG (or runs with an empty env var), this will fail later with a less actionable argparse error. Consider validating TRAIN_CONFIG in the same check (or defaulting it explicitly for this target).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants